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ABSTRACT 

Asymmetric processors have emerged as an appealing tech¬ 
nology for severely energy-constrained environments, espe¬ 
cially in the mobile market where heterogeneity in applica¬ 
tions is mainstream. In addition, given the growing inter¬ 
est on ultra low-power architectures for high performance 
computing, this type of platforms are also being investi¬ 
gated in the road towards the implementation of energy- 
efficient high-performance scientific applications. In this 
paper, we propose a first step towards a complete imple¬ 
mentation of the BLAS interface adapted to asymmetric 
ARM big.LITTLE processors, analyzing the trade-offs be¬ 
tween performance and energy efficiency when compared 
to existing homogeneous (symmetric) multi-threaded BLAS 
implementations. Our experimental results reveal important 
gains in performance while maintaining the energy efficiency 
of homogeneous solutions by efficiently exploiting all the re¬ 
sources of the asymmetric processor. 

Categories and Subject Descriptors 

C.1.3 [Computer Systems Organization]: Other Archi¬ 
tecture Styles —heterogeneous (hybrid) systems-, C.4 [Per¬ 
formance of systems] : Performance and energy efficiency; 
G.4 ]Mathematical Software]: Efficiency 

1. INTRODUCTION 

The decay of Dennard scaling [4] during the past decade 
marked the end of the “GHz race” and the shift towards 
multicore designs due to their more favorable performance- 
energy ratio. In addition, the doubling of transistors on 


chip with each new semiconductor generation, dictated by 
Moore’s law m, has only exacerbated the power wall prob¬ 
lem 0 1121114| , leading to the arise of “dark silicon” [6] and 
the deployment of heterogeneous facilities for high perfor¬ 
mance computing. 

Asymmetric multicore processors (AMPs) are a particular 
class of heterogeneous architectures equipped with cores that 
share the same instruction set architectur43 but differ in per¬ 
formance, complexity, and power consumption. AMPs have 
recently received considerable attention as a means to im¬ 
prove the performance-energy ratio of computing systems [9l 
mnsi I20 |. mainly by exploiting the presence of serial and 
parallel phases within applications. 

In this paper we investigate the practical performance-power- 
energy balance of ARM’s asymmetric big.LITTLE technol¬ 
ogy, employing as a case of study the compute-intensive gen¬ 
eral matrix multiplication (gemm): C += A ■ B, where the 
sizes of A, B, C are respectively m x k, k x n, m x n. Most 
previous related work targets the parallelization of gemm on 
i) distributed-memory heterogeneous architectures (see ElE] 
and references therein); or ii) asymmetric multicores, but us¬ 
ing trivial (unoptimized) implementations of gemm [lllllOj . 
Gompared with these other efforts, our paper makes the fol¬ 
lowing contributions: First, we leverage a static mapping 
of threads and we propose a workload partitioning strategy 
of the BLIS implementation of gemm specihcally tailored 
for the Exynos 5422 big.LITTLE architecture, a system- 
on-chip (SoC) featuring two processing clusters: an ARM 
Gortex-A15 quad core and a Cortex-A7 quad core. Second, 
we perform a detailed evaluation of our solution in terms of 
performance compared with that of the symmetric counter¬ 
part on each of the processing clusters of the Exynos 5422. 
Third, we perform an energy efficiency evaluation of each 


^According to this definition, servers equipped with one (or 
more) general-purpose multicore processor(s) and a PGIe- 
attached graphics accelerator, or systems-on-chip like the 
NVIDIA Tegra TKl, are excluded from this category. 



// Pack into B, 


Loop 1 for jc = 0,..., n — 1 in steps of 
Loop 2 for Pc = 0,... , fc — 1 in steps of fcc 

B{pc : pc + kc - l,jc ■■ jc + ric - 1) ^ Be 

Loop 3 for ic = 0,.. ., m — 1 in steps of rric 

_ A{ic : ic + rric — l,Pc ■■ Pc + kc - 1) Ac _ // Pack into Ac 

Loop 4 for jr = i),... jUc — I in steps of Ur // Macro-kernel 

Loop 5 for ir = 0,..., nic — 1 in steps of nir 

Cc{ir : ir + nir — 1, jr jr + Ur ~ 1) // Micro-kernel 

+= Ac{ir ■ ir + rrir — 1 , 0 : fcc — 1 ) 

Bc(0 ■ kc I 5 Jr ■ jr “t nr 1) 

endtor 

endfor 

endtor 

endfor 

endfor 

Figure 1: High performance implementation of GEMM in BLIS. In the code, Cc = C{ic ■ ic + me — l,jc ■ jc + ric — 1) 
is just a notation artifact, introduced to ease the presentation of the algorithm, while Ac, Be correspond to 
actual buffers that are involved in data copies. 


approach using the GFLOPS/W metric (equivalent to bil¬ 
lions of floating-point arithmetic operations, or flops, per 
Joule). 

2. MATRIX MULTIPLICATION FOR 
GENERAL-PURPOSE PROCESSORS 

Modern implementations of GEMM for general-purpose ar¬ 
chitectures, including BLIS and OpenBLAS, follow the ap¬ 
proach pioneered by GotoBLAS [7]. Concretely, BLIS im¬ 
plements GEMM as three nested loops around a macro-kernel 
plus two packing routines (see Loops 1-3 in Figure [TJ. The 
macro-kernel is then implemented in terms of two additional 
loops around a micro-kernel (Loops 4 and 5 in Figure [T]). In 
BLIS, the micro-kernel is typically implemented as a loop 
around a rank-1 (i.e., outer product) update using assembly 
or with vector intrinsics, while the remaining five loops are 
implemented in C; see m for further details. Furthermore, 
the BLIS (cache) optimization parameters Uc, kc, me, Ur 
and mr are adjusted taking into account the latencies of the 
floating-point units (FPUs), number of vector registers, and 
size/associativity degree of the cache levels. The goal is that 
Ac and a narrow column panel of Be, say Br, are feed into 
the floating-point units from the L2 and LI caches, respec¬ 
tively, and these transfers are fully amortized with enough 
computation from within the micro-kernel; see |13| . 

The parallelization of GEMM in BLIS is analyzed in |18| for 
conventional multi-threaded processors and m for extreme 
many-threaded architectures such as the IBM PowerPG A2 
(16 cores/64 threads) and the Intel Xeon Phi (60 cores/240 
threads). Basically, in both “types” of architectures, the par¬ 
allel implementations exploit the concurrency available in 
the nested 5-loop organization of the matrix multiplication 
algorithm at one or multiple levels (i.e., loops). In general, 
the approach takes into account the cache organization of 
the processor (e.g., the presence of multiple sockets, which 
cache levels are shared/private, etc.), while discarding the 
parallelization of loops that would incur into race conditions 
in the update of C as well as loops with too fine granularity. 
These analyses |181117| can be summarized as follows: 

• Parallelization of Loop 5 (indexed by ir). With this 
option, different threads execute different instances of 


the micro-kernel. Furthermore, they access the same 
column block Br (of Ur columns) in the LI cache. The 
amount of parallelism in this case, [^1; i® limited as 
rUc is usually a few hundreds. 

• Parallelization of Loop 4 (indexed by jr). Different 
threads access the same block Ac, of dimension me x kc, 
in the L2 cache. The time spent in this loop amortizes 
the cost of packing (moving) the block of Ac from main 
memory into the L2 cache. The amount of parallelism, 

is in general larger than in the previous case, as 
Uc is frequently in the order of several hundreds up to 
a few thousands. 

• Parallelization of Loop 3 (indexed by ic). Each thread 
packs a different block Ac into the L2 cache and ex¬ 
ecutes a different instance of the macro-kernel. The 
number of iterations of this loop is not limited by the 
blocking sizes, but instead depends on the problem di¬ 
mension m. When m is less than the product of me 
and the degree of parallelization of the loop, the blocks 
Ac will be smaller than the optimal dimension and per¬ 
formance may suffer. When there is a shared L2 cache, 
the size of the blocks Ac will have to be reduced by a 
factor equal to the degree of parallelization of this loop. 
However, reducing me is equivalent to parallelizing the 
first loop around the micro-kernel. 

• Parallelization of Loop 2 (indexed bype). This is not a 
good option because multiple threads simultaneously 
update the same parts of C, requiring a mechanism to 
deal with race conditions. 

• Parallelization of Loop 1 (indexed by Jc). Fromadata- 
sharing perspective, this option is equivalent to gaining 
parallelism outside of BLIS. In any case, this paral¬ 
lelization is reasonable on a multi-socket system where 
each CPU has a separate LLC (last-level cache). 

To sum up, these are general guidelines to decide which loops 
are theoretically good candidates to be parallelized in order 
to fully exploit the cache hierarchy of a target architecture. 
At a glance, the combination of loops to parallelize strongly 
depends on which cache(s) are shared. Usually, Loop 1 (jc) 
is a good candidate when the LLC is separated for each 






CPU (e.g., a multi-socket platform with on-chip L3 cache); 
Loop 3 (ic) should be parallelized when each core has its own 
L2 cache; and Loops 4 and/or 5 (jV and v, respectively) are 
to be parallelized when the cores share the L2 cache. 

3. MATRIX MULTIPLICATION ON AMPS 

The ODROID-XU3 contains a Samsung Exynos 5422 SoC 
with an ARM Cortex-A15 quad-core processing cluster (run¬ 
ning at 1.6 GHz in our setup) and a Cortex-A7 quad-core 
processing cluster (at 1.3 GHz). Both clusters access a 
shared DDRS RAM (2 Gbytes) via 128-bit coherent bus in¬ 
terfaces. Each ARM core (either Cortex-A15 or Cortex-A7) 
has a 32-|-32-Kbyte LI (instruction-|-data) cache. The four 
ARM Cortex-A15 cores share a 2-Mbyte L2 cache, while 
the four ARM Cortex-A7 cores share a smaller 512-Kbyte 
L2 cache; see Figure [2] 


Exynos 5422 System—on—Chip 



Figure 2: Exynos 5422 block diagram. 

In order to attain high performance, a preliminary step is 
to determine the optimal block sizes (me, kc, ric) for the 
target architecture and precision (all our experiments use 
IEEE 754 double-precision arithmetic). For this purpose, we 
performed an empirical search on the Cortex-A15 cores, de¬ 
tecting the optimal values at rric — 176 and kc = 368. In 
this architecture, ric plays a minor role and is simply set to 
ric = 4, 096 {ric is usually related to L3 cache, which is not 
present on these ARM CPUs). The micro-kernel for this 
architecture is hand-coded with m^ = 4 and rir = 4. These 
optimal values are used in this work for both the Cortex-A7 
and the Cortex-A15 cores. 

3.1 Mapping multi-threaded BLIS to AMPs 

BLIS allows to select, at run time, which (one or more) of 
the five internal loops are parallelized. In particular, if one 
of the loops is parallelized, a static partition and mapping of 
loop iteration chunks to the OpenMP threads is performed 
prior to the beginning of the loop. 

Our asymmetric version of BLIS integrates the following 
three new features, which modify the behavior of the multi¬ 
threaded BLIS at run time, in order to accomodate an AMP 
architecture: i) a, mechanism to create “slow” and ’’fast” 
threads, which will be bound upon initialization of the li¬ 
brary to LITTLE (Cortex-A7) and big (Cortex-A15) cores; 
ii) a mechanism to decide which one of the loops that are 
parallelized needs to be partitioned and assigned to slow/fast 
cores asymmetrically (thus, chunks assigned to threads will 
no longer be of uniform size, but partitioned according to 


the capabilities of each type of core); and in) an interface 
to specify the ratio of performance between LITTLE and 
big cores, which will ultimately define the number of iter¬ 
ations assigned to each thread/core. All these mechanisms 
are currently modified via environment variables, but the 
development of an ad-hoc API is part of ongoing work. 

For the target Exynos 5422 SoC, given the memory organi¬ 
zation of the this big.LITTLE architecture (private LI cache 
per core, shared L2 cache per cluster, lack of L3 cache), and 
the guidelines given for the parallelization of BLIS GEMM at 
the end of section [51 we chose the approach explained next 
for the parallelization on the target Exynos 5422 AMP. 

At a coarse-grain, the computational workload of the com¬ 
plete multiplication C -|-= A ■ B \s distributed among the 
Cortex-A15 and Cortex-A7 clusters by parallelizing either 
Loop 1 (j'c) or 3 (ic). In order to preserve the optimal cache 
parameters during the execution of GEMM, while attaining a 
distribution of the workload proportional to computational 
power of the A15 vs A7 clusters, we assign a different num¬ 
ber of iterations of the parallelized loop to each cluster; see, 
e.g., Figure [3] In particular, the ratio applied to distribute 
the iteration space between the Cortex-A15 and Cortex-A7 
for GEMM has been empirically determined to be 6: a 

At a finer-grain, the execution of each macro-kernel Cc += 
Ac ■ Be (see Figure [T]) is partitioned among the cores of the 
same type by parallelizing Loops 4 (jV), 5 (U) or both; see, 
e.g., Figure m 


ic ic 




Figure 3: Workload distributions for the matrix 
multiplication C -|-= A ■ B between the A15 and A7 
quad-core clusters. Top: parallelization of Loop 1 
(jc)', bottom: parallelization of Loop 3 (ic)- In the 
bottom plot, the small rectangles, delimited by the 
fine lines, denote the operands of the macro-kernel 
Cc 4 -= Ac • Be* 

4. EVALUATION OF PERFORMANCE AND 
ENERGY EFFICIENCY 

The goal of the performance and energy efficiency tests in 
this section is to carry out an experimental study of both 

^This ratio varies depending on the target architecture, core 
operating frequency, and specific routine, so it should be 
adjusted accordingly. 

































































































Figure 4: Workload distributions for the macro¬ 
kernel multiplication Cc += A^-Bc between four cores 
of the same type (Cq, Cj, C 2 , C 3 ). Top: paral¬ 
lelization of Loop 4 (jr); bottom: parallelization of 
Loop 5 (ir)- In this example, the OpenMP chunk 
size equals 2 in the first case and 4 in the second. 


metrics comparing the original multi-threaded of GEMM in 
BLIS against our asymmetric-aware implementation. In all 
tests, we ensure the cores run at their highest frequency by 
setting the performance governor. Codes are instrumented 
with the pmlib [T] framework, which collects power con¬ 
sumption data corresponding to instantaneous power read¬ 
ings from four independent sensors in the board (for the 
Cortex-A7 cores, Cortex-A15 cores, DRAM and GPU), with 
a sampling rate of 200 ms. 

The first round of experiments analyzes the performance 
and energy behavior of the Cortex-A7 and the Cortex-A15 
core types when working in isolation. For this purpose, we 
execute a collection of GEMM kernels using one of the fine- 
grain parallelization exposed in Section [S] Concretely, as 
the L2 cache is shared among the cores of a cluster, we par¬ 
allelize Loop 4 using 1, 2, 3 and 4 threads (cores), with the 
performance and energy results in Figure [5] These plots 
reveal that the Cortex-A15 cores clearly deliver higher per¬ 
formance, with a rough increase of 2.5 GFLOPS per core, 
attaining a peak performance of about 10.2 GFLOPS with 
4 threads. For the Cortex-A7 cores, the performance peaks 
are around 2.0 GFLOPS and is also attained with 4 cores. 
Regarding energy efficiency, the Cortex-A15 obtains the best 
results in terms of GFLOPS/W. However, the benefits from 
increasing the number of threads in this case are less sig¬ 
nificant (0.055 GFLOPS/W per core) when compared with 
those obtained with the Gortex-A7 cores (0.193 GFLOPS/W 
per core). It is also worth emphasizing that the use of 4 
Gortex-A7 cores is more energy-efficient than an alternative 
that leverages a single Gortex-A15 core, though the overall 
performance of the former is slightly worse. 

The second round of experiments evaluates the performance 
and energy efficiency of the asymmetric-aware port of BLIS 
to the big.LITTLE architecture. For this purpose, we run a 
collection of GEMM kernels, relaying on a 2-way paralleliza¬ 
tion to distribute iterations of Loop 3 (see Section [Sjl, with 



Figure 5: Performance (top) and energy efficiency 
(bottom) of the BLIS DGEMM using exclusively 
one type of core, for a varying number of threads. 


a ratio of 6:1, among the cores of the fast and slow clus¬ 
ters, and taking advantage of the independent L2 cache per 
cluster in this manner. For the fine-grain parallelization, 4 
threads are leveraged in order to assign chunks of the it¬ 
eration space for Loop 4 to each core within the cluster. 
Our experiments with different configurations revealed this 
option to be the most efficient for the target big.LITTLE 
architecture. 

Figure in] reports the results for this second evaluation. The 
line labeled as “big.LITTLE (4-|-4 threads)” corresponds to 
the asymmetric-aware implementation. The same GEMM 
kernels were computed with BLIS using a symmetric work¬ 
load distribution (the iteration space is equally distributed 
among the Gortex-A7 and Cortex-A15 cores), with the re¬ 
sults labelled as “A7+A15 (4+4 threads)” in the figure. For 
comparison purposes, the performance and energy obtained 
using exclusively four Gortex-A7 or four Gortex-A15 CPUs 
are also added. Finally, the “ideal” line corresponds to the 
sum of the peak performances of the configurations that use 
four cores of each of the two types in isolation (i.e., the per¬ 
formance of the four Cortex-A15 cores plus the performance 
of the four Cortex-A7 cores). 

These performance results show that the AMP configuration 
outperforms the peak performance of all other configurations 
being close to the ideal case. The increment compared to 
the configuration that employs four Cortex-A15 cores for 
the largest tested problem is close to 20%. The asymmetric 
version does not outperform the original version for small 


























































































































































































matrices though, as the chunks assigned to the big and LIT¬ 
TLE cores are, in those cases, too small to exploit the asym¬ 
metric architecture. In terms of energy-efficiency, the AMP 
configuration is as efficient as the symmetric setup using 
exclusively four Cortex-A15 CPU. 

The symmetric workload distribution attains about 40% of 
the highest performance that is observed when employing 
only the Cortex-A15 cores. The reason is that, with the 
symmetric workload distribution, thread scheduling is del¬ 
egated to the operating system or the OpenMP runtime, 
using a homogeneous distribution of chunks. This causes a 
severe load imbalance as the fast Cortex-A15 threads fin¬ 
ish processing their assigned chunk, and have to wait a long 
time for the Cortex-A7 threads to complete their assign¬ 
ment. The energy-efficiency is also affected, and this config¬ 
uration achieves the worst energy-efficiency. 


BLIS DGEMM performance on Exynos 5422 



BLIS DGEMM energy efficiency on Exynos 5422 



Figure 6: Performance (top) and energy efficiency 
(bottom) of the BLIS DGEMM implementations us¬ 
ing a single as well as different types of cores. 

Diving into details that explain the energy efficiency of our 
implementations, Table[T]shows a breakdown of power/energy 
per component of the SoC, for a particular problem size: 
m = n = k = 4, 096. This table shows the (average) power 
consumption and energy efficiency when employing i) from 1 
to 4 threads of a single cluster; ii) the AMP configuration 
with all 4-1-4 cores; and in) the symmetric configuration of 
BLIS using all 4-1-4 cores. The hrst four columns report the 
average power consumption gathered from the SoC sensors, 
while the average power consumption of the entire SoC is in 
the fifth column. The performance achieved by the differ¬ 
ent configurations is reported in the sixth column and the 
energy efficiency is displayed in the last one. 


The hrst aspect to note is that, as expected, the Cortex-A15 
cores dissipate more power than the Cortex-A7 cores. In¬ 
deed, a single Cortex-A15 core roughly doubles the power 
dissipation rate of four combined Cortex-A7 cores, and the 
Cortex-A15 CPU in idle state consumes more power than 
two Cortex-A7 cores in execution. A second issue is that the 
memory (DRAM) and total power consumption of the AMP 
and symmetric conhgurations are close to those obtained by 
adding the corresponding values of the two CPU clusters in 
isolation. An exception is the total power consumption with 
the symmetric conhguration, in which a signihcant decrease 
is observed due to the Cortex-A15 cores completing their 
share of the work much earlier than the Cortex-A7 cores. 
This aspect strongly affects the energy efficiency of the sym¬ 
metric configuration as the power consumption is three times 
higher than that obtained with the entire Cortex-A7 cluster, 
but the performance is only doubled. As expected, the AMP 
conhguration is the one that dissipates a higher power rate, 
as it fully utilizes all the available resources. On the other 
hand, it also obtains the shortest execution time, yielding 
the best energy-to-solution. 

5. CONCLUSIONS 

In this paper, we have proposed several mechanisms to map 
the high-performance multi-threaded implementation of the 
matrix multiplication in the BLIS library to an asymmetric 
ARM big.LITTLE (Cortex A15-I-A7) SoC. Our results re¬ 
veal excellent improvements in performance compared with 
a homogeneous implementation that operates exclusively on 
one type of core (either A15 or A7), and also with respect 
to multi-threaded implementations that rely on a symmetric 
work distribution and delegate scheduling to the operating 
system. 

This is the hrst step towards a full BLAS implementation op¬ 
timized for big.LITTLE architectures, which is the ultimate 
goal of our work. We believe that the approach applied to 
GEMM carries over to the rest of the BLAS. However, there 
are still a number of issues that need to be addressed to 
further increase performance and adaptation to the architec¬ 
ture. Among those, the most signihcant ones are the integra¬ 
tion of diherent micro-kernels and block sizes tuned to each 
type of core in order to extract the maximum performance, 
and the dynamic distribution and mapping of the workload 
to each type of core transparently to the programmer. A 
port to a 64-bit ARMvS architecture, and performing a ex¬ 
perimental study on architectures with different number of 
big/LITTLE cores are also key milestones in our roadmap. 
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